Evaluating Concrete Strength Model Performance

Using Cross-validation Methods

Sai Devarasheyyt, Mattick, Musson, Perez

2024-07-27

Introduction To Crossvalidation

  • Measure performance and generalizability of machine learning and predictive models.
  • Compare different models constructed from the same data set.

CV widely used in various fields including:

  • Machine Learning
  • Data Mining
  • Bioinformatics
  • Minimize overfitting
  • Ensure a model generalizes to unseen data
  • Tune hyperparameters

Definitions

Generalizability:
How well predictive models created from a sample fit other samples from the same population.

Overfitting:
When a model fits the the underlying patterns of the training data too well.

Model fits characteristics specific to the training set:

  • Noise
  • Random fluctuations
  • Outliers

Hyperparameters:
Are model configuration variables

Nodes and layers in a neural network

Branches in a decision tree

Process

Subsets the data into K approximately equally sized folds

  • Randomly
  • Without replacement

Split The Subsets into test and training sets

  • 1 test set
  • K-1 training set

  • Fit the model to the training data
  • Apply the fitted model to the test set
  • Measure the prediction error

Repeat K Times

  • Fit to all K-1 combinations
  • Test with each subset 1 time

Calculate the mean error

Cross-validation Methods

Method Comparison
Method Computation Bias Variance
K-Fold Lowest Intermediate Lower than LOOCV
Nested Medium Lowest Lower than LOOCV
Leave-one-out Highest Unbiased High

  • Shuffle-split
  • Stratified k-fold
  • Leave-p-out
  • Monte Carlo
  • Time series
  • Many others

What method us right for your:

  • Data type
  • Model type

Study Objectives

The goal of this paper is to explore the methodology of cross-validation and its application in evaluating the performance of predictive models for concrete strength. Concrete strength is a crucial parameter in construction, directly impacting the safety, durability, and cost-effectiveness of structures. Accurate prediction of concrete strength allows for optimal design, better resource allocation, and improved construction practices. Traditional methods of model validation, such as holdout validation, can sometimes provide misleading performance estimates due to their reliance on a single training-validation split. Cross-validation addresses this limitation by using multiple splits, thus providing a more robust evaluation of the model. As previously mentioned, cross-validation is a statistical technique used to assess the generalizability and reliability of a model by partitioning the data into multiple subsets, training the model on some subsets while validating it on others. This process helps prevent overfitting, ensuring that the model performs well on new, unseen data, and provides a more accurate estimate of the model’s performance.

Concrete strength is a crucial parameter in construction, directly impacting the safety, durability, and cost-effectiveness of structures. Accurate prediction of concrete strength allows for optimal design, better resource allocation, and improved construction practices. Traditional methods of model validation, such as holdout validation, can sometimes provide misleading performance estimates due to their reliance on a single training-validation split. Cross-validation addresses this limitation by using multiple splits, thus providing a more robust evaluation of the model.

In this study, we apply cross-validation to a dataset containing measurements of concrete strength. We aim to demonstrate how different cross-validation techniques, such as k-fold cross-validation and leave-one-out cross-validation, can be used to evaluate the performance of predictive models. By comparing these techniques, we seek to identify the most effective method for assessing model accuracy and reliability in the context of predicting concrete strength.

Methods

Model Measures of Error

Measuring the quality of fit of a regression model is an important step in data modeling. There are several commonly used metrics used to quantify how well a model explains the data. By measuring the quality of fit we can select the model that makes the most accurate predictions on unseen data. Common metrics used to measure model performance are:

  • Mean Absolute Error (MAE)

The Mean Absolute Error is a measure error magnitude. The sine of the error does not matter because MAE uses the absolute value. Small MAE values, “lower magnitude” indicate better model fit. MAE is calculated (1) by averaging the absolute difference between the observed \((y_i)\) and predicted \(\hat{f}(x_i)\) values. Where:

  • \(n\) is the number of observations,
  • \(\hat{f}(x_i)\) is the prediction that the regression function \(\hat{f}\) gives for the ith observation,
  • \(y_i\) is the observed value.

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{f}(x_i)| \tag{1} \]

  • Root Mean Squared Error (RMSE)

The Root Mean Squared Error (2) is a measure of error magnitude also. Like MAE, smaller RMSE values indicate better model fit. In this method the square error \((y_i - \hat{f}(x_i))^2\) values are used. Squaring the error give more weight to the larger ones. In contrast with the MAE that uses the absolute error \(|y_i - \hat{f}(x_i)|\) values, all errors are weighted equally regardless of size. Taking the square root returns the error to the same units as the response variable, making it easier to interpret.

\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2} \tag{2} \]

  • R-squared (\(R^2\))

R-squared is the percent of the variance in the response variable that is explained by the predictor variable(s). Unlike MAR and RMSE, \(R^2\) values range from 0 to 1 and the higher the value, the better the fit. An \(R^2\) value of 0.75 indicates that 75% of the variance in the response variable can be explained by the predictor variable(s). The \(R^2\) equation (3) is composed of two key parts, the Total Sum of Squares (\(SS_{tot}\)) and the Residual Sum of Squares (\(SS_{res}\)).

\[ \text{R}^2 = \frac{SS_{tot}-SS_{res}}{SS_{tot}} = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{f}(x_i))^2}{\sum_{i=1}^{n}(y_i-\bar{f}(x_i))^2} \tag{3} \]

(James et al. 2013), (Hawkins, Basak, and Mills 2003), (Helsel and Hirsch 1993)

K-Fold Cross-Validation

\[ CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k} \text{Measuer of Errori}_i \tag{4} \]

Process:

  1. Prepare the data
    Subsets the data randomly and without replacement into K equally sized folds. Each fold will contain approximately n/k observations. For example when n = 200 and K = 5, then each fold have 200/5 = 40 observations. If n = 201, then one of the folds would have 41, the other four folds would have 40 observations.

  2. Split the folds into test and training sets
    In the previous example, if you had 5 folds, we could choose the first set to be the test set and the other 4 would be the training set. It doesn’t make a difference which one you choose as all of the folds will eventually be test folds against the other 4 ?@fig-kfold.

  3. Fit model to the training data
    Take the model you are going to use for prediction and fit it to the training data. Continuing with our example, you would use the 4 training folds to fit the model. Take the fitted model you developed in step 3 and apply it to the 1 test fold. After applying it to the model, you would take the resulting prediction and determine the accuracy by comparing what was predicted from the training folds to the actual values from the test fold.

  4. Repeat steps 2 - 4
    In the example, if you were using K = 5, then you would pick one of the folds you have not previously used and make it the test fold and the other 4 the training fold. In this way, every observation will be a member of the test fold once and training folds 4 times.

  5. Calculate the mean error
    Measure the error after each fold has been used as the test fold. Take the mean measure error of all folds from step 4.
    (Song, Tang, and Wee 2021)

Leave One Out Cross-validations (LOOCV)

\[ CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n} \text{Measuer of Errori}_i \tag{5} \]

Process: The steps for LOOCV are almost identical to k-fold cross-validation. The only difference is that in K-fold, K must be less than the number of observations (n). In LOOCV, K = n, so when you split the data into testing and training data, the first testing fold is one of the observations and the training data would be every other observation ?@fig-LOOCV. In this way, every observation is tested against every other observation and the process would be repeated n times (James et al. 2013).

Nested Cross-Validation

  1. Split the data into training and testing sets
    As in k-fold cross-validation, break the observations into the single test fold and the training folds. For example, if there are 300 observations and you use K = 5, four of the folds would be training folds and one of them would be the training fold.

  2. Define inner and outer loops
    We define the test fold as the outer loop and use that to test the performance of the model. The training loops will be defined as the inner loop and we will use that to test which parameters we should use.

  3. Split the inner loop into training sets and validation sets
    The inner loop (or training folds) is broken in half. Half of that data will be used as training and the other half will be used as evaluation ?@fig-NestCV.

  4. Fit the model to the inner loop
    We choose the number of parameters that we are going to use for validation and fit it to the model. After fitting, you will store the accuracy value for those parameters. We then switch the validation and training sets from the inner loop and fit them to the model. After receiving another accuracy score, we would average them together with the previous accuracy score for that number of parameters.

  5. Choose another number of parameters
    We would then choose a different number of parameters and repeat step 4. After determining the average accuracy for the new set of parameters, we would compare it to the average accuracy produced by the other parameters. The number of parameters that produces the highest average accuracy is chosen for that training fold.

  6. Repeat the process K-times
    After getting an accuracy score for each training fold, we find the average of all folds which will give us the average accuracy of the model. (Berrar et al. 2019).

Analysis and Results

Data extraction, transformation and visulation

(I-C Yeh 1998) modeled compression strength of high performance concrete (HPC) at various ages and made with different ratios of components ?@tbl-data. The data used for their study was made publicly available and can be downloaded UCI Machine Learning Repository (I-Cheng Yeh 2007).

Data Exploration and Visulation

  • Target variable:
    • Strength in MPa
  • Predictor variables:
    • Cement in kg in a m3 mixture
    • Superplasticizer kg in a m3 mixture
    • Age in days
    • Water kg in a m3 mixture

All variables are quantitative

Linear Regression Model

Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.2578655 5.1878634 5.446918 1.0e-07
Cement 0.0668433 0.0039668 16.850539 0.0e+00
Superplasticizer 0.8716897 0.0903825 9.644449 0.0e+00
Age 0.1110466 0.0069538 15.969235 0.0e+00
Water -0.1195600 0.0257210 -4.648334 3.9e-06

\[ \hat{Strength} = 28.258_\text{Cement + } 0.067_\text{Superplasticizer + } 0.872_\text{Age } 0.111_\text{Water} \]

Linear Regression CV Results

  • K-Fold Results:
Measure of Error Result
RMSE 12.13
MAE 9.23
R2 0.46

  • LOOCV Results:
Measure of Error Result
RMSE 12.13
MAE 9.23
R2 0.46

  • Nested CV Results:
Measure of Error Result
RMSE 11.87
MAE 9.43
R2 0.49

LightGBM Model

 

Measure of Error Result
RMSE 8.73
MAE 6.82
R2 0.73

 


  • Ensemble of decision trees
  • Uses gradient boosting
  • Final prediction is the sum of predictions from all individual trees
  • Feature importance

LightGBM CV Results

  • K-Fold Results:
Measure of Error Result
RMSE 8.73
MAE 6.82
R2 0.73

  • LOOCV Results:
Measure of Error Result
RMSE 13.48
MAE 10.91
R2 0.35

  • Nested CV Results:
Measure of Error Result
RMSE 8.27
MAE 6.39
R2 0.75

Comparison of Models

  • Performance Comparison:
      Linear Regression vs. LightGBM
  • Advantages and disadvantages
     of each model
Method Measure of Error Linear Regression LightGBM
5-Fold RMSE 12.13 8.73
5-Fold MAE 9.23 6.82
5-Fold R2 0.46 0.73
LOOCV RMSE 12.13 13.48
LOOCV MAE 9.23 10.91
LOOCV R2 0.46 0.35
NCV RMSE 11.87 8.27
NCV MAE 9.43 6.39
NCV R2 0.49 0.75

Modle Comparison K-Fold Plot

Modle Comparison LOOCV Plot

Modle Comparison Nested CV Plot

LightGBM (Light Gradient Boosting Machine)

  • Description: A gradient boosting framework that uses tree-based learning algorithms.
  • Pros: High efficiency, fast training, and capable of handling large datasets.
  • Cons: Requires careful tuning of parameters.

Predictive Performance Comparison

  • LightGBM outperformed traditional models and cross-validation techniques.
  • Lower prediction errors and more reliable performance metrics.
  • Demonstrated strong generalization capabilities.

Computational Efficiency

  • LightGBM: Fast training and efficient computation.
  • Nested Cross-Validation: Excellent performance but computationally intensive.
  • Efficiency crucial for real-world applications with limited resources.

Conclution conclusion

  • Cross-validation techniques and LightGBM effectively reduce overfitting and enhance model accuracy.
  • LightGBM offers superior accuracy and efficiency.
  • Identified key predictors for accurate model development.
  • Robust framework for model evaluation, improving decision-making in concrete design and construction.

Future Research

  • Further refinement of these techniques to improve predictive accuracy.
  • Exploration of additional advanced models.
  • Application in various engineering contexts to enhance model reliability and performance.

References

All figures were created by the authors.

Berrar, Daniel et al. 2019. “Cross-Validation.”
Hawkins, Douglas M, Subhash C Basak, and Denise Mills. 2003. “Assessing Model Fit by Cross-Validation.” Journal of Chemical Information and Computer Sciences 43 (2): 579–86.
Helsel, Dennis R, and Robert M Hirsch. 1993. Statistical Methods in Water Resources. Elsevier.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Song, Q Chelsea, Chen Tang, and Serena Wee. 2021. “Making Sense of Model Generalizability: A Tutorial on Cross-Validation in r and Shiny.” Advances in Methods and Practices in Psychological Science 4 (1): 2515245920947067.
Yeh, I-C. 1998. “Modeling of Strength of High-Performance Concrete Using Artificial Neural Networks.” Cement and Concrete Research 28 (12): 1797–1808.
Yeh, I-Cheng. 2007. Concrete Compressive Strength.” UCI Machine Learning Repository.